R Notebook Questions, refined:

Do more Marylanders donate to in-state or out-of-state candidates? Which out-of-state candidates received the most donations and the greatest amount of money in donations? What are the demographics of people who donate to Maryland races only? To out-of-state races only? (using voter data OR using census data to see if those donors are clustered in specific locations, and, if they are, what the makeups of those locations are) Among top Maryland donors, what professions donate the most money to senate campaigns? What’s the makeup of donations received by Hogan and Alsobrooks? What percentage of their overall donations were large amounts of money (to be defined, but > $1000, for example) vs small amounts? Which party tends to donate more to out-of-state races?

We refined our dataset further by deciding to focus on individual donations made to senate races by donors in Maryland. Using the FEC website, we downloaded a dataset that met those parameters. We cleaned the data by removing unnecessary columns and renaming the columns where needed. We decided to limit our dataset to donations made in 2024 to ensure that the donations were made after the primary and before the general election. We hoped this would limit the number of candidates receiving donations, since most donations will go toward one of two parties in a certain race.

We also cleaned the data-set, removing numerous columns that were N/A or that were unnecessary to our project. We also had to rename one column and then filter to only the report year of 2024 to limit the donations. To answer our first question and questions, we’ll need to separate the out-of-state candidates from in-state candidates, but this should be easy because we’ll just need to filter out the Hogan and Alsobrooks committees to isolate the out-of-state candidates receiving donations. To answer our third question, we’ll need to either call in demographic data from the census or identify specific locations with clusters of donations, and look up information about those locations using census.gov. Question #4 asks about the professions of donors. One obvious challenge here is that professions and titles are often spelled or worded differently, even if they describe the same job– in this dataset, there are 2,738 different professions listed. To try to combat this, we will use open refine to reduce the number of occupations with slight variations in spelling or wording. We will then limit the data set to the top 1000 donations to make the analysis more manageable. We can also make some broad observations about the data even without cleaning the professions– for example, it’s obvious that the most donations come from people who say they are “not employed” (45568) or “retired” (29424). The next highest number of donations is more than 20,000 donations less than the number of donations from retired people. There are 207 committees that received donations in our dataset. Originally, we thought that we would use open refine to sort the committees by candidate name, because we thought that most candidates would have multiple committees. However, we found that this was actually pretty uncommon, and we weren’t able to match any names using open refine, once we uploaded our csv. For question 5, we need to do some extra research to decide how to define large donations. One idea is to find the average or median donation, and use that number as the dividing line between small and large donations. Another idea is to see if there’s an agreed-upon definition by people who work on campaigns or political scientists. We have one extra question (#6) that we would like to answer, but it involves party-level analysis that we can’t do because we don’t have the political parties of the donors in this data, and we don’t have anything in this data that indicates the party of each candidate. To answer this question, we could join this data with data from Act Blue and Win Red to identify the parties of donors. We can also probably find a dataset with the committee names and parties for all senate candidates, and join that dataset with ours.

Turn off scientific notation


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxub3B0aW9ucyhzY2lwZW49OTk5KVxubGlicmFyeSh0aWR5dmVyc2UpXG5saWJyYXJ5KGx1YnJpZGF0ZSlcblxuXG5gYGAifQ== -->

```r
options(scipen=999)
library(tidyverse)
library(lubridate)

```

<!-- rnb-source-end -->
```r
options(scipen=999)
library(tidyverse)
library(lubridate)

<!-- rnb-source-end -->


<!-- rnb-output-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->



<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-output-begin eyJkYXRhIjoiXG48IS0tIHJuYi1zb3VyY2UtYmVnaW4gZXlKa1lYUmhJam9pWUdCZ2NseHVhVzV6ZEdGc2JDNXdZV05yWVdkbGN5Z25kR2xrZVdObGJuTjFjeWNwWEc1Z1lHQWlmUT09IC0tPlxuXG5gYGByXG5pbnN0YWxsLnBhY2thZ2VzKCd0aWR5Y2Vuc3VzJylcbmBgYFxuXG48IS0tIHJuYi1zb3VyY2UtZW5kIC0tPlxuIn0= -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuaW5zdGFsbC5wYWNrYWdlcygndGlkeWNlbnN1cycpXG5gYGAifQ== -->

```r
install.packages('tidycensus')

  There is a binary version available but the source version is later:
installing the source package ‘tidycensus’

trying URL 'https://cran.rstudio.com/src/contrib/tidycensus_1.6.7.tar.gz'
Content type 'application/x-gzip' length 2262396 bytes (2.2 MB)
==================================================
downloaded 2.2 MB

* installing *source* package ‘tidycensus’ ...
** package ‘tidycensus’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** data
*** moving datasets to lazyload DB

* removing ‘/Library/Frameworks/R.framework/Versions/4.2/Resources/library/tidycensus’
* restoring previous ‘/Library/Frameworks/R.framework/Versions/4.2/Resources/library/tidycensus’
Warning in install.packages :
  installation of package ‘tidycensus’ had non-zero exit status

The downloaded source packages are in
    ‘/private/var/folders/2l/xcvc2w597t9714c985cwdbr40000gp/T/RtmpfQuLsL/downloaded_packages’
library(tidycensus)
md_senate_contributions <- read_csv("data/md_senate_contributions.csv") |>


print(column_names)
New names:Warning: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat)Rows: 115275 Columns: 79── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (44): committee_id, committee_name...2, report_type, line_number, transaction_id, entity_type, entity_type_desc, unused_contbr_id, ...
dbl  (11): report_year, image_number, file_number, contributor_zip, contribution_receipt_amount, contributor_aggregate_ytd, conduit_comm...
lgl  (22): committee_name...9, recipient_committee_org_type, memo_code_full, candidate_id, candidate_name, candidate_first_name, candida...
dttm  (2): contribution_receipt_date, load_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#clean data

cleaned_data <- md_senate_contributions |>

select( -unused_contbr_id, -committee_name...9, -recipient_committee_org_type, -contributor_suffix, -contributor_street_2, -contributor_id, -memo_code, -memo_code_full, -candidate_id, -candidate_name, -candidate_first_name, -candidate_last_name, -candidate_middle_name, -candidate_prefix, -candidate_suffix, -candidate_office, -candidate_office_full, -candidate_office_state, -candidate_office_state_full, -candidate_office_district, -conduit_committee_id, -conduit_committee_name, -conduit_committee_street1, -conduit_committee_street2, -conduit_committee_city, -conduit_committee_state, -conduit_committee_zip, -donor_committee_name, -national_committee_nonfederal_account, -election_type_full, -increased_limit, -is_individual) |>

rename(committee_name = `committee_name...2`) |>

filter(report_year == 2024)


colnames(cleaned_data)
 [1] "committee_id"                    "committee_name"                  "report_year"                     "report_type"                    
 [5] "image_number"                    "line_number"                     "transaction_id"                  "file_number"                    
 [9] "entity_type"                     "entity_type_desc"                "contributor_prefix"              "contributor_name"               
[13] "recipient_committee_type"        "recipient_committee_designation" "contributor_first_name"          "contributor_middle_name"        
[17] "contributor_last_name"           "contributor_street_1"            "contributor_city"                "contributor_state"              
[21] "contributor_zip"                 "contributor_employer"            "contributor_occupation"          "receipt_type"                   
[25] "receipt_type_desc"               "receipt_type_full"               "contribution_receipt_date"       "contribution_receipt_amount"    
[29] "contributor_aggregate_ytd"       "election_type"                   "fec_election_type_desc"          "fec_election_year"              
[33] "amendment_indicator"             "amendment_indicator_desc"        "schedule_type_full"              "load_date"                      
[37] "original_sub_id"                 "back_reference_transaction_id"   "back_reference_schedule_name"    "filing_form"                    
[41] "link_id"                         "memo_text"                       "two_year_transaction_period"     "schedule_type"                  
[45] "sub_id"                          "pdf_url"                         "line_number_label"              
most_contributions <- cleaned_data |>
select(committee_name, contribution_receipt_amount) |>
group_by(committee_name) |>
summarize(total_contribution = sum(contribution_receipt_amount, na.rm = TRUE)) |>
arrange(desc(total_contribution)) 
cleaned_data |>
  
write_csv("data/cleaned_md_senate_contributions.csv")
data <- read_delim("data/ccl.txt", delim = "|", col_names = FALSE)
Rows: 8568 Columns: 7── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "|"
chr (4): X1, X4, X5, X6
dbl (3): X2, X3, X7
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cleaned_data_ccl <- data |>
rename(candidate_id = X1,candidate_election_year = X2, fec_election_year = X3, committee_id = X4,committee_type = X5,committee_design = X6, linkage_id = X7)

#inner join the committee id and fec election year

joined_data <- inner_join(cleaned_data, cleaned_data_ccl, by = c("committee_id", "fec_election_year"))
Warning: Detected an unexpected many-to-many relationship between `x` and `y`.

#insert the candidate data

candidate_data <- read_delim("data/candidate_data.txt", delim = "|", col_names = FALSE) |>
  select(X1, X2, X3, X5) |> 
 rename(candidate_id = X1, candidate_name = X2, candidate_party = X5)
Rows: 3824 Columns: 30── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "|"
chr  (7): X1, X2, X3, X5, X19, X20, X28
dbl (18): X4, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X26, X27, X29, X30
lgl  (5): X21, X22, X23, X24, X25
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
candidate_joined_data <- inner_join(joined_data, candidate_data, by =c("candidate_id"))

#include only these rows


final_data <- candidate_joined_data |>
select(committee_id, committee_name, report_year, entity_type_desc,contributor_prefix, contributor_name, contributor_first_name, contributor_last_name, contributor_middle_name, contributor_street_1,contributor_city, contributor_state, contributor_zip,contributor_employer, contributor_occupation, contribution_receipt_date, contribution_receipt_amount, contributor_aggregate_ytd,memo_text,pdf_url, candidate_id, candidate_election_year)

Q1: Do more Marylanders donate to in-state or out-of-state candidates?

final_data_states <- final_data |>
  mutate(state = substr(candidate_id, 3, 4))
state_contributions <- final_data_states |>
  group_by(state) |>
  summarise(
    contribution_count = n(),                
    total_contribution = sum(contribution_receipt_amount, na.rm = TRUE)  
  )
maryland_contrib <- state_contributions |>
  filter(state == "MD") |>
  summarise(total_maryland_contrib = sum(total_contribution, na.rm = TRUE))


other_states_contrib <- state_contributions |>
  filter(state != "MD") |>
  summarise(total_other_states_contrib = sum(total_contribution, na.rm = TRUE))


comparison <- data.frame(
  state = c("Maryland", "Other States"),
  total_contribution = c(maryland_contrib$total_maryland_contrib, other_states_contrib$total_other_states_contrib)
)

A: Montana was surprisingly the state with the most contributions — edging Maryland with 202 more contributions Maryland. Maryland received the highest total sum however, with $5,516,562.24. That was $4,074,226.07 more than the next highest state, which was Montana. The other states combined recieved $7,358,970, a little less than two million more than just the state of Maryland.

Q2: Which out-of-state candidates received the most donations and the greatest amount of money in donations?

final_data_no_md <- final_data_states |>
  filter(state != "MD")
contributions_summary <- final_data_no_md |>
  group_by(committee_name) |>
  summarise(
    num_contributions = n(), 
    total_contribution_sum = sum(contribution_receipt_amount, na.rm = TRUE) 
  ) |>
  arrange(committee_name)

A2: Jon Tester, Sherrod Brown and Ruben Gallego were the top three in that order for both contributions recieved and the sum of contributions. The similarity between the three of them is that they were all Democrats running in tightly contested races. Brown was the only to lose his race. Jon Tester also clearly recieved more than the other two — which explains why Montana received so many more out-of-state contributions than other states. Montanans for Tester received 3,679 more contributions and $238,017.96 more in total contributions compared to Friends of Sherrod Brown.

Q3: What are the demographics of people who donate to Maryland races only? To out-of-state races only?

Step 1: find people who donate to ONLY Maryland races.


maryland_only_donors <- final_data_states|>
  group_by(contributor_name) |>                
  summarize(only_md = all(state == "MD")) |>
  filter(only_md) |>                          
  select(contributor_name)                    

full_maryland_only_donors <- final_data_states |>
   filter(contributor_name %in% maryland_only_donors$contributor_name)


zips_full_maryland_only_donors <- full_maryland_only_donors |>
  group_by(contributor_name, contributor_zip) |>
  summarize(
    num_donations = n(),             
    total_donated = sum(contribution_receipt_amount) 
  )|>
  arrange(desc(num_donations))  
`summarise()` has grouped output by 'contributor_name'. You can override using the `.groups` argument.
md_zip_summary <- zips_full_maryland_only_donors |>
  group_by(contributor_zip) |>              
  summarize(
    num_donations = n(),                      
    total_donated = sum(total_donated)      
  ) |>
  arrange(desc(num_donations)) 
all_zcta <- get_acs(
  geography = "zcta",
  variables = c(
 "B01003_001", 
     "B01002_001", #= "median_age",
 "B19013_001", # = "median_income",
  "B02001_002", # = "white_alone",
     "B02001_003", # = "black_alone",
"B02001_004", # = "native_american",
    "B02001_005", #= "asian_alone",
 "B02001_006", # = "hawaiian_pacific",
"B02001_007",  # = "other_race",
    "B02001_008", # = "two_or_more_races"
"B03002_012E"),
  
  year = 2022,
  survey = "acs5",
  output = "wide"
)
Getting data from the 2018-2022 5-year ACS
all_zcta <- all_zcta %>%
  rename(
    total_population = B01003_001E,      
    median_age = B01002_001E,            
    median_income = B19013_001E,        
    white_alone = B02001_002E,          
    black_alone = B02001_003E,         
    native_american = B02001_004E,      
    asian_alone = B02001_005E,          
    hawaiian_pacific = B02001_006E,     
    other_race = B02001_007E,           
    two_or_more_races = B02001_008E,
    hispanic_or_latino = B03002_012E
  )

zips_full_maryland_only_donors <- zips_full_maryland_only_donors |>
  mutate(
    contributor_zip = substr(contributor_zip, 1, 5),
    contributor_zip = sprintf("%05s", contributor_zip)
  )

zips_full_maryland_only_donors <- zips_full_maryland_only_donors |>
  mutate(contributor_zip = as.character(contributor_zip))

merged_data_md <- zips_full_maryland_only_donors %>%
  left_join(all_zcta, by = c("contributor_zip" = "GEOID"))

summarized_md_donor_demos_by_zip <- merged_data_md %>%
  group_by(contributor_zip) |>
  summarize(
    total_population = first(total_population),  
    median_age = first(median_age),             
    median_income = first(median_income),       
    white_alone = first(white_alone),            
    black_alone = first(black_alone),           
    native_american = first(native_american),   
    asian_alone = first(asian_alone),           
    hawaiian_pacific = first(hawaiian_pacific),  
    other_race = first(other_race),  
     hispanic_or_latino = first(hispanic_or_latino),
    two_or_more_races = first(two_or_more_races),
    num_donations = n(),                      
    total_amt_donated = sum(total_donated, na.rm = TRUE) 
  ) |> 
  arrange(desc(num_donations))  |> 
  mutate(
    white_alone_per_capita = white_alone / total_population,
    black_alone_per_capita = black_alone / total_population,
    native_american_per_capita = native_american / total_population,
    asian_alone_per_capita = asian_alone / total_population,
    hawaiian_pacific_per_capita = hawaiian_pacific / total_population,
    other_race_per_capita = other_race / total_population,
     hispanic_or_latino_per_capita = hispanic_or_latino / total_population,
    two_or_more_races_per_capita = two_or_more_races / total_population,
    median_income_per_capita = median_income / total_population
  ) |>
   mutate(
    white_alone_percent = white_alone_per_capita * 100,
    black_alone_percent = black_alone_per_capita * 100,
    native_american_percent = native_american_per_capita * 100,
    asian_alone_percent = asian_alone_per_capita * 100,
    hawaiian_pacific_percent = hawaiian_pacific_per_capita * 100,
    other_race_percent = other_race_per_capita * 100,
    hispanic_or_latino_percent = hispanic_or_latino_per_capita *100,
    two_or_more_races_percent = two_or_more_races_per_capita * 100,
    median_income_per_capita = median_income_per_capita * 100
  ) |>
 mutate(
    amt_donated_per_resident = total_amt_donated / total_population
  ) |>   
  select(where(~ !all(is.na(.))))

cleaner_md_donor_demos <- summarized_md_donor_demos_by_zip |> 
    select(-matches("_per_capita"), -"native_american", -"hawaiian_pacific", -"other_race", -"two_or_more_races", -"white_alone", -"black_alone", -"asian_alone") 
 

#repeat for out of state donors

out_of_state_donors <- final_data_states |>
  filter(state != "MD")

xmaryland_donors <- final_data_states |>
  filter(state == "MD")

out_of_state_only_donors <- out_of_state_donors |>
  filter(!contributor_name %in% xmaryland_donors$contributor_name)  |>
group_by(contributor_name, contributor_zip) |>
  summarize(
    num_donations = n(),            
    total_donated = sum(contribution_receipt_amount)  
  ) |>
  arrange(desc(num_donations))  
`summarise()` has grouped output by 'contributor_name'. You can override using the `.groups` argument.
out_of_state_only_donors <- out_of_state_only_donors |>
  mutate(
    contributor_zip = substr(contributor_zip, 1, 5),
    contributor_zip = sprintf("%05s", contributor_zip)
  )

merged_data_oos <- out_of_state_only_donors|>
  left_join(all_zcta, by = c("contributor_zip" = "GEOID"))

cleaner_oos_donor_demos <- merged_data_oos  |>
   group_by(contributor_zip) |>
  summarize(
    total_population = first(total_population), 
    median_age = first(median_age),              
    median_income = first(median_income),        
    white_alone = first(white_alone),         
    black_alone = first(black_alone),            
    native_american = first(native_american),   
    asian_alone = first(asian_alone),           
    hawaiian_pacific = first(hawaiian_pacific),
    other_race = first(other_race),     
    hispanic_or_latino = first(hispanic_or_latino),
    two_or_more_races = first(two_or_more_races),
    num_donations = n(),                      
    total_amt_donated = sum(total_donated, na.rm = TRUE) 
  ) |> 
  arrange(desc(num_donations))  |> 
  mutate(
   
    white_alone_per_capita = white_alone / total_population,
    black_alone_per_capita = black_alone / total_population,
    native_american_per_capita = native_american / total_population,
    asian_alone_per_capita = asian_alone / total_population,
    hawaiian_pacific_per_capita = hawaiian_pacific / total_population,
    other_race_per_capita = other_race / total_population,
    hispanic_or_latino_per_capita = hispanic_or_latino / total_population,
    two_or_more_races_per_capita = two_or_more_races / total_population,
    median_income_per_capita = median_income / total_population
  ) |>
   mutate(
  
    white_alone_percent = white_alone_per_capita * 100,
    black_alone_percent = black_alone_per_capita * 100,
    native_american_percent = native_american_per_capita * 100,
    asian_alone_percent = asian_alone_per_capita * 100,
    hawaiian_pacific_percent = hawaiian_pacific_per_capita * 100,
    other_race_percent = other_race_per_capita * 100,
    hispanic_or_latino_percent = hispanic_or_latino_per_capita *100,
    two_or_more_races_percent = two_or_more_races_per_capita * 100,
    median_income_per_capita = median_income_per_capita * 100  
  ) 

cleaner_oos_donor_demos <- cleaner_oos_donor_demos |> select(-matches("_per_capita"), -"native_american", -"hawaiian_pacific", -"other_race", -"two_or_more_races", -"white_alone", -"black_alone", -"asian_alone") |> 
  
  mutate(
    amt_donated_per_resident = total_amt_donated / total_population
  ) |>
  select(where(~ !all(is.na(.))))
head(merged_data_oos)
cleaner_oos_donor_demos |>   
  mutate(
    contributor_zip = as.character(contributor_zip),         # Ensure ZIP codes are strings
    contributor_zip = str_pad(contributor_zip, width = 5,    # Pad with leading zeros to ensure 5 digits
                              side = "left", pad = "0"),
    contributor_zip = ifelse(str_detect(contributor_zip, "^\\d{5}$"),  # Keep only valid 5-digit ZIP codes
                             contributor_zip, NA))


cleaner_oos_donor_demos |> arrange(desc(amt_donated_per_resident)) |> write_csv("cleaner_oos_donor_demos.csv")
cleaner_md_donor_demos |> 
  mutate(
    contributor_zip = as.character(contributor_zip),         # Ensure ZIP codes are strings
    contributor_zip = str_pad(contributor_zip, width = 5,    # Pad with leading zeros to ensure 5 digits
                              side = "left", pad = "0"),
    contributor_zip = ifelse(str_detect(contributor_zip, "^\\d{5}$"),  # Keep only valid 5-digit ZIP codes
                             contributor_zip, NA)) |> 
  arrange(desc(amt_donated_per_resident)) |> write_csv("cleaner_md_donor_demos.csv")

cleaner_oos_donor_demos |> write_csv(“cleaner_oos_donor_demos.csv”)

It looks like in this data, the top contributing zip codes are the same for out-of-state and in-state donations. The top out-of-state donors donated much more than the in-state donors, and there were more of them in the top zip codes. A top in-state donor zip code is Anapolis, and doesn’t appear in the top ten zip codes for out-of-state.

As we keep going to answer this question, we should calculate the amount donated per person. We should consider presenting this data with two maps side-by-side.

Question 4: Among top Maryland donors, what professions donate the most money to senate campaigns?

individual_donors  <- final_data_states|>
group_by(contributor_name, contributor_zip, contributor_occupation, contributor_employer) |> 
  summarize(
    num_donations = n(),               
    total_donated = sum(contribution_receipt_amount) 
  ) |>
  arrange(desc(num_donations))  

jobs_to_clean <- individual_donors |> 
  group_by(contributor_occupation) |>
  summarize(
    number_jobs = n(),
    num_donations = n(),  
     total_donations = sum(total_donated)
  ) |> arrange(desc(number_jobs)) |>
write_csv("jobs_to_clean.csv")
boss_to_clean  <- individual_donors |> 
  group_by(contributor_employer) |>
  summarize(
    number_jobs = n(),
    num_donations = n(),  
     total_donations = sum(total_donated)
  ) |> arrange(desc(number_jobs)) |>
write_csv("boss_to_clean.csv")
clean_employer <- read_csv("cleaned_boss.csv")
clean_occupation <- read_csv("clean_jobs.csv")

clean_occupation |> 
  group_by(cleaned_jjobs)  |> 
  summarize(
    number_jobs = n(),
    num_donations = n(),  
     total_donated = sum(total_donations)) |> 
  arrange(desc(number_jobs))
clean_occupation |> 
  group_by(cleaned_jjobs)  |> 
  summarize(
    number_jobs = n(),
    num_donations = n(),  
     total_donated = sum(total_donations)) |> 
  arrange(desc(total_donated))
clean_employer |> 
  group_by(cleaned_boss) |>
  summarize(
    number_jobs = n(),
    num_donations = n(),  
     total_donated = sum(total_donations)
  ) |>
  arrange(desc(number_jobs))

To answer this question, I put a csv of occupations into OpenRefine. We originally planned to limit the data set, but I found that there were only 1,963 job titles in the data set, which seemed like a reasonable number to refine down into a smaller list of jobs.

To do this, I grouped certain jobs into categories like “director” and “executive” – so job titles like “sales director” went into the “director” category.

I think the OpenRefine method may be flawed/not tell us very much because I was only able to get the list down to 1,464 jobs after about 1 hour of refining. I also worry that the categorizing method I used obsures information that might be important – Director of Sales is different from Director of Education, for example, and these jobs were grouped into the same position. And there are such variations in job title that I was not able to categorize all job titles well.

What we can tell from this data is that the most frequently-held jobs among Maryland donors are leadership/management positions, so likely well-paid. The most money was donated by people who are not employed or retired.

We should consider repeating this question with information about employers. I’ve started to do this in OpenRefine, but this will take longer, because there are more than 7,000 different employers. However, many of these seem to be able to be reconciled in OpenRefine. And I’m interested to see how many donations/how much money was donated by people who work for top employers in Maryland, like Johns Hopkins.

#Question 5: What’s the makeup of donations received by Hogan and Alsobrooks? What percentage of their overall donations were large amounts of money (to be defined, but > $1000, for example) vs small amounts?

hogan_filtered_data <- final_data |>
  filter(committee_name %in% c("HOGAN FOR MARYLAND INC."))
large_donation_threshold <- 1000

hogan_filtered_data |> mutate( donation_category = ifelse(contribution_receipt_amount > large_donation_threshold, "Large", "Small") )

hogan_category_totals <- hogan_filtered_data |>
  mutate(donation_category = ifelse(contribution_receipt_amount > large_donation_threshold, "Large", "Small")) |>
  group_by(donation_category) |>
  summarise(total_amount = sum(contribution_receipt_amount, na.rm = TRUE), .groups = "drop")

hogan_category_totals %>% mutate( percentage = total_amount / sum(total_amount) * 100 )

hogan_category_totals
alsobrooks_filtered_data <- final_data |>
  filter(committee_name %in% c("ALSOBROOKS FOR SENATE"))

large_donation_threshold <- 1000

alsobrooks_category_totals <- alsobrooks_filtered_data |>
  mutate(donation_category = ifelse(contribution_receipt_amount > large_donation_threshold, "Large", "Small")) |>
  group_by(donation_category) |>
  summarise(total_amount = sum(contribution_receipt_amount, na.rm = TRUE), .groups = "drop")

alsobrooks_category_totals  %>% mutate( percentage = total_amount / sum(total_amount) * 100 )

alsobrooks_category_totals

A5: Here we see that Hogan receives signficantly more money in general in comparison to Alsobrooks. Additionally, Hogan receives many more large donations (in excess of millions) in comparison to Alsobrooks. Whereas, the small donations are around 600,000 more. This shows that eithe Hogan has a larger “fan base” of people who donate to him or perhaps his electorate is just wealthier. Additionally,it would be interesting to note whether Hogan has a higher amount of donations because he was known before.

---
title: "R Notebook"
output: html_notebook
---

R Notebook
Questions, refined:

Do more Marylanders donate to in-state or out-of-state candidates? Which out-of-state candidates received the most donations and the greatest amount of money in donations? What are the demographics of people who donate to Maryland races only? To out-of-state races only? (using voter data OR using census data to see if those donors are clustered in specific locations, and, if they are, what the makeups of those locations are) Among top Maryland donors, what professions donate the most money to senate campaigns? What’s the makeup of donations received by Hogan and Alsobrooks? What percentage of their overall donations were large amounts of money (to be defined, but > $1000, for example) vs small amounts? Which party tends to donate more to out-of-state races?

We refined our dataset further by deciding to focus on individual donations made to senate races by donors in Maryland. Using the FEC website, we downloaded a dataset that met those parameters. We cleaned the data by removing unnecessary columns and renaming the columns where needed. We decided to limit our dataset to donations made in 2024 to ensure that the donations were made after the primary and before the general election. We hoped this would limit the number of candidates receiving donations, since most donations will go toward one of two parties in a certain race.

We also cleaned the data-set, removing numerous columns that were N/A or that were unnecessary to our project. We also had to rename one column and then filter to only the report year of 2024 to limit the donations. To answer our first question and questions, we’ll need to separate the out-of-state candidates from in-state candidates, but this should be easy because we’ll just need to filter out the Hogan and Alsobrooks committees to isolate the out-of-state candidates receiving donations. To answer our third question, we’ll need to either call in demographic data from the census or identify specific locations with clusters of donations, and look up information about those locations using census.gov. Question #4 asks about the professions of donors. One obvious challenge here is that professions and titles are often spelled or worded differently, even if they describe the same job– in this dataset, there are 2,738 different professions listed. To try to combat this, we will use open refine to reduce the number of occupations with slight variations in spelling or wording. We will then limit the data set to the top 1000 donations to make the analysis more manageable. We can also make some broad observations about the data even without cleaning the professions– for example, it’s obvious that the most donations come from people who say they are “not employed” (45568) or “retired” (29424). The next highest number of donations is more than 20,000 donations less than the number of donations from retired people. There are 207 committees that received donations in our dataset. Originally, we thought that we would use open refine to sort the committees by candidate name, because we thought that most candidates would have multiple committees. However, we found that this was actually pretty uncommon, and we weren’t able to match any names using open refine, once we uploaded our csv. For question 5, we need to do some extra research to decide how to define large donations. One idea is to find the average or median donation, and use that number as the dividing line between small and large donations. Another idea is to see if there’s an agreed-upon definition by people who work on campaigns or political scientists. We have one extra question (#6) that we would like to answer, but it involves party-level analysis that we can’t do because we don’t have the political parties of the donors in this data, and we don’t have anything in this data that indicates the party of each candidate. To answer this question, we could join this data with data from Act Blue and Win Red to identify the parties of donors. We can also probably find a dataset with the committee names and parties for all senate candidates, and join that dataset with ours.

# Turn off scientific notation

```{r}
options(scipen=999)
library(tidyverse)
library(lubridate)


```

```{r}
install.packages('tidycensus')
```

```{r}
library(tidycensus)
```



```{r}
md_senate_contributions <- read_csv("data/md_senate_contributions.csv") |>


print(column_names)
```

#clean data
```{r}
cleaned_data <- md_senate_contributions |>

select( -unused_contbr_id, -committee_name...9, -recipient_committee_org_type, -contributor_suffix, -contributor_street_2, -contributor_id, -memo_code, -memo_code_full, -candidate_id, -candidate_name, -candidate_first_name, -candidate_last_name, -candidate_middle_name, -candidate_prefix, -candidate_suffix, -candidate_office, -candidate_office_full, -candidate_office_state, -candidate_office_state_full, -candidate_office_district, -conduit_committee_id, -conduit_committee_name, -conduit_committee_street1, -conduit_committee_street2, -conduit_committee_city, -conduit_committee_state, -conduit_committee_zip, -donor_committee_name, -national_committee_nonfederal_account, -election_type_full, -increased_limit, -is_individual) |>

rename(committee_name = `committee_name...2`) |>

filter(report_year == 2024)


colnames(cleaned_data)
```

```{r}
most_contributions <- cleaned_data |>
select(committee_name, contribution_receipt_amount) |>
group_by(committee_name) |>
summarize(total_contribution = sum(contribution_receipt_amount, na.rm = TRUE)) |>
arrange(desc(total_contribution)) 
cleaned_data |>
  
write_csv("data/cleaned_md_senate_contributions.csv")
data <- read_delim("data/ccl.txt", delim = "|", col_names = FALSE)

cleaned_data_ccl <- data |>
rename(candidate_id = X1,candidate_election_year = X2, fec_election_year = X3, committee_id = X4,committee_type = X5,committee_design = X6, linkage_id = X7)
```

#inner join the committee id and fec election year 

```{r}
joined_data <- inner_join(cleaned_data, cleaned_data_ccl, by = c("committee_id", "fec_election_year"))
```

#insert the candidate data
```{r}
candidate_data <- read_delim("data/candidate_data.txt", delim = "|", col_names = FALSE) |>
  select(X1, X2, X3, X5) |> 
 rename(candidate_id = X1, candidate_name = X2, candidate_party = X5)

candidate_joined_data <- inner_join(joined_data, candidate_data, by =c("candidate_id"))
```


#include only these rows 

```{r}

final_data <- candidate_joined_data |>
select(committee_id, committee_name, report_year, entity_type_desc,contributor_prefix, contributor_name, contributor_first_name, contributor_last_name, contributor_middle_name, contributor_street_1,contributor_city, contributor_state, contributor_zip,contributor_employer, contributor_occupation, contribution_receipt_date, contribution_receipt_amount, contributor_aggregate_ytd,memo_text,pdf_url, candidate_id, candidate_election_year)
```

Q1: Do more Marylanders donate to in-state or out-of-state candidates? 

```{r}
final_data_states <- final_data |>
  mutate(state = substr(candidate_id, 3, 4))
```

```{r}
state_contributions <- final_data_states |>
  group_by(state) |>
  summarise(
    contribution_count = n(),                
    total_contribution = sum(contribution_receipt_amount, na.rm = TRUE)  
  )
```

```{r}
maryland_contrib <- state_contributions |>
  filter(state == "MD") |>
  summarise(total_maryland_contrib = sum(total_contribution, na.rm = TRUE))


other_states_contrib <- state_contributions |>
  filter(state != "MD") |>
  summarise(total_other_states_contrib = sum(total_contribution, na.rm = TRUE))


comparison <- data.frame(
  state = c("Maryland", "Other States"),
  total_contribution = c(maryland_contrib$total_maryland_contrib, other_states_contrib$total_other_states_contrib)
)
```


A: Montana was surprisingly the state with the most contributions --- edging Maryland with 202 more contributions Maryland. Maryland received the highest total sum however, with $5,516,562.24. That was $4,074,226.07 more than the next highest state, which was Montana. The other states combined recieved $7,358,970, a little less than two million more than just the state of Maryland. 


Q2: Which out-of-state candidates received the most donations and the greatest amount of money in donations? 

```{r}
final_data_no_md <- final_data_states |>
  filter(state != "MD")
```

```{r}
contributions_summary <- final_data_no_md |>
  group_by(committee_name) |>
  summarise(
    num_contributions = n(), 
    total_contribution_sum = sum(contribution_receipt_amount, na.rm = TRUE) 
  ) |>
  arrange(committee_name)
```

A2: Jon Tester, Sherrod Brown and Ruben Gallego were the top three in that order for both contributions recieved and the sum of contributions. The similarity between the three of them is that they were all Democrats running in tightly contested races. Brown was the only to lose his race. Jon Tester also clearly recieved more than the other two --- which explains why Montana received so many more out-of-state contributions than other states. Montanans for Tester received 3,679 more contributions and $238,017.96 more in total contributions compared to Friends of Sherrod Brown.

Q3: What are the demographics of people who donate to Maryland races only? To out-of-state races only?

Step 1: find people who donate to ONLY Maryland races.
```{r}

maryland_only_donors <- final_data_states|>
  group_by(contributor_name) |>                
  summarize(only_md = all(state == "MD")) |>
  filter(only_md) |>                          
  select(contributor_name)                    

full_maryland_only_donors <- final_data_states |>
   filter(contributor_name %in% maryland_only_donors$contributor_name)


zips_full_maryland_only_donors <- full_maryland_only_donors |>
  group_by(contributor_name, contributor_zip) |>
  summarize(
    num_donations = n(),             
    total_donated = sum(contribution_receipt_amount) 
  )|>
  arrange(desc(num_donations))  

md_zip_summary <- zips_full_maryland_only_donors |>
  group_by(contributor_zip) |>              
  summarize(
    num_donations = n(),                      
    total_donated = sum(total_donated)      
  ) |>
  arrange(desc(num_donations)) 

```
```{r}
all_zcta <- get_acs(
  geography = "zcta",
  variables = c(
 "B01003_001", 
     "B01002_001", #= "median_age",
 "B19013_001", # = "median_income",
  "B02001_002", # = "white_alone",
     "B02001_003", # = "black_alone",
"B02001_004", # = "native_american",
    "B02001_005", #= "asian_alone",
 "B02001_006", # = "hawaiian_pacific",
"B02001_007",  # = "other_race",
    "B02001_008", # = "two_or_more_races"
"B03002_012E"),
  
  year = 2022,
  survey = "acs5",
  output = "wide"
)

all_zcta <- all_zcta %>%
  rename(
    total_population = B01003_001E,      
    median_age = B01002_001E,            
    median_income = B19013_001E,        
    white_alone = B02001_002E,          
    black_alone = B02001_003E,         
    native_american = B02001_004E,      
    asian_alone = B02001_005E,          
    hawaiian_pacific = B02001_006E,     
    other_race = B02001_007E,           
    two_or_more_races = B02001_008E,
    hispanic_or_latino = B03002_012E
  )

```
```{r}

zips_full_maryland_only_donors <- zips_full_maryland_only_donors |>
  mutate(
    contributor_zip = substr(contributor_zip, 1, 5),
    contributor_zip = sprintf("%05s", contributor_zip)
  )

zips_full_maryland_only_donors <- zips_full_maryland_only_donors |>
  mutate(contributor_zip = as.character(contributor_zip))

merged_data_md <- zips_full_maryland_only_donors %>%
  left_join(all_zcta, by = c("contributor_zip" = "GEOID"))

summarized_md_donor_demos_by_zip <- merged_data_md %>%
  group_by(contributor_zip) |>
  summarize(
    total_population = first(total_population),  
    median_age = first(median_age),             
    median_income = first(median_income),       
    white_alone = first(white_alone),            
    black_alone = first(black_alone),           
    native_american = first(native_american),   
    asian_alone = first(asian_alone),           
    hawaiian_pacific = first(hawaiian_pacific),  
    other_race = first(other_race),  
     hispanic_or_latino = first(hispanic_or_latino),
    two_or_more_races = first(two_or_more_races),
    num_donations = n(),                      
    total_amt_donated = sum(total_donated, na.rm = TRUE) 
  ) |> 
  arrange(desc(num_donations))  |> 
  mutate(
    white_alone_per_capita = white_alone / total_population,
    black_alone_per_capita = black_alone / total_population,
    native_american_per_capita = native_american / total_population,
    asian_alone_per_capita = asian_alone / total_population,
    hawaiian_pacific_per_capita = hawaiian_pacific / total_population,
    other_race_per_capita = other_race / total_population,
     hispanic_or_latino_per_capita = hispanic_or_latino / total_population,
    two_or_more_races_per_capita = two_or_more_races / total_population,
    median_income_per_capita = median_income / total_population
  ) |>
   mutate(
    white_alone_percent = white_alone_per_capita * 100,
    black_alone_percent = black_alone_per_capita * 100,
    native_american_percent = native_american_per_capita * 100,
    asian_alone_percent = asian_alone_per_capita * 100,
    hawaiian_pacific_percent = hawaiian_pacific_per_capita * 100,
    other_race_percent = other_race_per_capita * 100,
    hispanic_or_latino_percent = hispanic_or_latino_per_capita *100,
    two_or_more_races_percent = two_or_more_races_per_capita * 100,
    median_income_per_capita = median_income_per_capita * 100
  ) |>
 mutate(
    amt_donated_per_resident = total_amt_donated / total_population
  ) |>   
  select(where(~ !all(is.na(.))))


```
```{r}

cleaner_md_donor_demos <- summarized_md_donor_demos_by_zip |> 
    select(-matches("_per_capita"), -"native_american", -"hawaiian_pacific", -"other_race", -"two_or_more_races", -"white_alone", -"black_alone", -"asian_alone") 
 


```
#repeat for out of state donors 

```{r}
out_of_state_donors <- final_data_states |>
  filter(state != "MD")

xmaryland_donors <- final_data_states |>
  filter(state == "MD")

out_of_state_only_donors <- out_of_state_donors |>
  filter(!contributor_name %in% xmaryland_donors$contributor_name)  |>
group_by(contributor_name, contributor_zip) |>
  summarize(
    num_donations = n(),            
    total_donated = sum(contribution_receipt_amount)  
  ) |>
  arrange(desc(num_donations))  


out_of_state_only_donors <- out_of_state_only_donors |>
  mutate(
    contributor_zip = substr(contributor_zip, 1, 5),
    contributor_zip = sprintf("%05s", contributor_zip)
  )


```
```{r}

merged_data_oos <- out_of_state_only_donors|>
  left_join(all_zcta, by = c("contributor_zip" = "GEOID"))

cleaner_oos_donor_demos <- merged_data_oos  |>
   group_by(contributor_zip) |>
  summarize(
    total_population = first(total_population), 
    median_age = first(median_age),              
    median_income = first(median_income),        
    white_alone = first(white_alone),         
    black_alone = first(black_alone),            
    native_american = first(native_american),   
    asian_alone = first(asian_alone),           
    hawaiian_pacific = first(hawaiian_pacific),
    other_race = first(other_race),     
    hispanic_or_latino = first(hispanic_or_latino),
    two_or_more_races = first(two_or_more_races),
    num_donations = n(),                      
    total_amt_donated = sum(total_donated, na.rm = TRUE) 
  ) |> 
  arrange(desc(num_donations))  |> 
  mutate(
   
    white_alone_per_capita = white_alone / total_population,
    black_alone_per_capita = black_alone / total_population,
    native_american_per_capita = native_american / total_population,
    asian_alone_per_capita = asian_alone / total_population,
    hawaiian_pacific_per_capita = hawaiian_pacific / total_population,
    other_race_per_capita = other_race / total_population,
    hispanic_or_latino_per_capita = hispanic_or_latino / total_population,
    two_or_more_races_per_capita = two_or_more_races / total_population,
    median_income_per_capita = median_income / total_population
  ) |>
   mutate(
  
    white_alone_percent = white_alone_per_capita * 100,
    black_alone_percent = black_alone_per_capita * 100,
    native_american_percent = native_american_per_capita * 100,
    asian_alone_percent = asian_alone_per_capita * 100,
    hawaiian_pacific_percent = hawaiian_pacific_per_capita * 100,
    other_race_percent = other_race_per_capita * 100,
    hispanic_or_latino_percent = hispanic_or_latino_per_capita *100,
    two_or_more_races_percent = two_or_more_races_per_capita * 100,
    median_income_per_capita = median_income_per_capita * 100  
  ) 

cleaner_oos_donor_demos <- cleaner_oos_donor_demos |> select(-matches("_per_capita"), -"native_american", -"hawaiian_pacific", -"other_race", -"two_or_more_races", -"white_alone", -"black_alone", -"asian_alone") |> 
  
  mutate(
    amt_donated_per_resident = total_amt_donated / total_population
  ) |>
  select(where(~ !all(is.na(.))))


```
```{r}
head(merged_data_oos)
```


```{r}
cleaner_oos_donor_demos |>   
  mutate(
    contributor_zip = as.character(contributor_zip),         # Ensure ZIP codes are strings
    contributor_zip = str_pad(contributor_zip, width = 5,    # Pad with leading zeros to ensure 5 digits
                              side = "left", pad = "0"),
    contributor_zip = ifelse(str_detect(contributor_zip, "^\\d{5}$"),  # Keep only valid 5-digit ZIP codes
                             contributor_zip, NA))


cleaner_oos_donor_demos |> arrange(desc(amt_donated_per_resident)) |> write_csv("cleaner_oos_donor_demos.csv")

```

```{r}
cleaner_md_donor_demos |> 
  mutate(
    contributor_zip = as.character(contributor_zip),         # Ensure ZIP codes are strings
    contributor_zip = str_pad(contributor_zip, width = 5,    # Pad with leading zeros to ensure 5 digits
                              side = "left", pad = "0"),
    contributor_zip = ifelse(str_detect(contributor_zip, "^\\d{5}$"),  # Keep only valid 5-digit ZIP codes
                             contributor_zip, NA)) |> 
  arrange(desc(amt_donated_per_resident)) |> write_csv("cleaner_md_donor_demos.csv")
```


cleaner_oos_donor_demos |> write_csv("cleaner_oos_donor_demos.csv")



It looks like in this data, the top contributing zip codes are the same for out-of-state and in-state donations. The top out-of-state donors donated much more than the in-state donors, and there were more of them in the top zip codes. A top in-state donor zip code is Anapolis, and doesn't appear in the top ten zip codes for out-of-state.

As we keep going to answer this question, we should calculate the amount donated per person. We should consider presenting this data with two maps side-by-side. 


# Question 4: Among top Maryland donors, what professions donate the most money to senate campaigns? 

```{r}
individual_donors  <- final_data_states|>
group_by(contributor_name, contributor_zip, contributor_occupation, contributor_employer) |> 
  summarize(
    num_donations = n(),               
    total_donated = sum(contribution_receipt_amount) 
  ) |>
  arrange(desc(num_donations))  

jobs_to_clean <- individual_donors |> 
  group_by(contributor_occupation) |>
  summarize(
    number_jobs = n(),
    num_donations = n(),  
     total_donations = sum(total_donated)
  ) |> arrange(desc(number_jobs)) |>
write_csv("jobs_to_clean.csv")

```

```{r}
boss_to_clean  <- individual_donors |> 
  group_by(contributor_employer) |>
  summarize(
    number_jobs = n(),
    num_donations = n(),  
     total_donations = sum(total_donated)
  ) |> arrange(desc(number_jobs)) |>
write_csv("boss_to_clean.csv")
```


```{r}
clean_employer <- read_csv("cleaned_boss.csv")
clean_occupation <- read_csv("clean_jobs.csv")

clean_occupation |> 
  group_by(cleaned_jjobs)  |> 
  summarize(
    number_jobs = n(),
    num_donations = n(),  
     total_donated = sum(total_donations)) |> 
  arrange(desc(number_jobs))

```
```{r}
clean_occupation |> 
  group_by(cleaned_jjobs)  |> 
  summarize(
    number_jobs = n(),
    num_donations = n(),  
     total_donated = sum(total_donations)) |> 
  arrange(desc(total_donated))

```


```{r}
clean_employer |> 
  group_by(cleaned_boss) |>
  summarize(
    number_jobs = n(),
    num_donations = n(),  
     total_donated = sum(total_donations)
  ) |>
  arrange(desc(number_jobs))
```


To answer this question, I put a csv of occupations into OpenRefine. We originally planned to limit the data set, but I found that there were only 1,963 job titles in the data set, which seemed like a reasonable number to refine down into a smaller list of jobs. 

To do this, I grouped certain jobs into categories like "director" and "executive" -- so job titles like "sales director" went into the "director" category. 

I think the OpenRefine method may be flawed/not tell us very much because I was only able to get the list down to 1,464 jobs after about 1 hour of refining. I also worry that the categorizing method I used obsures information that might be important -- Director of Sales is different from Director of Education, for example, and these jobs were grouped into the same position. And there are such variations in job title that I was not able to categorize all job titles well. 

What we can tell from this data is that the most frequently-held jobs among Maryland donors are leadership/management positions, so likely well-paid. The most money was donated by people who are not employed or retired. 

We should consider repeating this question with information about employers. I've started to do this in OpenRefine, but this will take longer, because there are more than 7,000 different employers. However, many of these seem to be able to be reconciled in OpenRefine. And I'm interested to see how many donations/how much money was donated by people who work for top employers in Maryland, like Johns Hopkins. 


#Question 5: What’s the makeup of donations received by Hogan and Alsobrooks? What percentage of their overall donations were large amounts of money (to be defined, but > $1000, for example) vs small amounts?

```{r}
hogan_filtered_data <- final_data |>
  filter(committee_name %in% c("HOGAN FOR MARYLAND INC."))
large_donation_threshold <- 1000

hogan_filtered_data |> mutate( donation_category = ifelse(contribution_receipt_amount > large_donation_threshold, "Large", "Small") )

hogan_category_totals <- hogan_filtered_data |>
  mutate(donation_category = ifelse(contribution_receipt_amount > large_donation_threshold, "Large", "Small")) |>
  group_by(donation_category) |>
  summarise(total_amount = sum(contribution_receipt_amount, na.rm = TRUE), .groups = "drop")

hogan_category_totals %>% mutate( percentage = total_amount / sum(total_amount) * 100 )

hogan_category_totals
```

```{r}
alsobrooks_filtered_data <- final_data |>
  filter(committee_name %in% c("ALSOBROOKS FOR SENATE"))

large_donation_threshold <- 1000

alsobrooks_category_totals <- alsobrooks_filtered_data |>
  mutate(donation_category = ifelse(contribution_receipt_amount > large_donation_threshold, "Large", "Small")) |>
  group_by(donation_category) |>
  summarise(total_amount = sum(contribution_receipt_amount, na.rm = TRUE), .groups = "drop")

alsobrooks_category_totals  %>% mutate( percentage = total_amount / sum(total_amount) * 100 )

alsobrooks_category_totals
```
A5: Here we see that Hogan receives signficantly more money in general in comparison to Alsobrooks. Additionally, Hogan receives many more large donations (in excess of millions) in comparison to Alsobrooks. Whereas, the small donations are around 600,000 more. This shows that eithe Hogan has a larger "fan base" of people who donate to him or perhaps his electorate is just wealthier. Additionally,it would be interesting to note whether Hogan has a higher amount of donations because he was known before. 
